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Abstract. During the process of citation matching links from bibliog- 
raphy entries to referenced publications are created. Such links are in- 
dicators of topical similarity between linked texts, are used in assessing 
the impact of the referenced document and improve navigation in the 
user interfaces of digital libraries. In this paper we present a citation 
matching method and show how to scale it up to handle great amounts 
of data using appropriate indexing and a MapReduce paradigm in the 
Hadoop environment. 
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1 Introduction 

Since Hitchcock et al. [T] demonstrated a proof-of-concept system that performed 
autonomous linking within Cognitive Science Open Journal, the problem of cita- 
tion matching (i.e. linking citation strings referencing the same paper) has been 
tackled in countless papers by means of various methods [2] . 

Huge interest in a citation resolution is not a surprise as it is a fundamental 
step in creating a digital library of scholarly publications. Having relationships 
between documents conveying the fact that document A references document 
B allows to provide more user-friendly interfaces pQ, perform scientometrical 
analysis 0] and link-based classification [5j [6] . 

Considering the rapid growth of the number of scientific publications, we need 
to seek new ways of dealing with large amounts of data. Recently, a MapReduce 
paradigm [7] and Apache Hadoop, its open-source implementation, have been 
gaining popularity. It has already been used for entity matching by Paradies et 
al. 0. 

In this paper we present our own approach to citation matching in Hadoop 
environment. We start by demonstrating a small scale matching method in Sec- 
tion [51 An interesting author similarity measure is presented in Section 12.21 
A very important part in the scalability of our solution is played by an approx- 
imate index for heuristic matching, which is described in Section [3] Finally, in 
Section [4j some details of index building and actual citation matching imple- 
mentation utilising MapReduce paradigm are revealed. 



2 Small scale disambiguation 

First of all, let us look on how reference disambiguation is performed when 
working on a small scale. As we do not need to deal with great amounts of data 
here, we can afford to count the similarity between every citation string pair. 
Having pairwise similarities, we can apply any clustering algorithm. In our tests, 
we have used a basic single-link algorithm. 

Let us describe first things first, though. To begin with, we need to extract 
metadata from a given reference string. We will describe that in section 12.11 
However, metadata extracted from references may be inaccurate or malformed. 
We have therefore developed measures of fussy match. They are described in 
section 12.21 From match factor of particular metadata fields, we need to draw 
conclusions about the whole citation. SVM [9] is employed for that task. Finally, 
the accuracy evaluation is presented in section T2.3I 

2.1 Citation parsing 

One of citation matcher requirements is the access to the metadata of input cita- 
tions. Unfortunately, in some cases the matcher has to deal with citations in the 
form of raw strings. In such situations citations need to be preprocessed in order 
to extract the required metadata. This is done by a citation parser, whose role 
is to identify fragments of the input citation string containing meaningful pieces 
of metadata information. The information we extract at this stage includes: an 
author, a title, a journal name, pages and a year of publication. 

The parsing is performed in several steps. First, a reference string is tokenised 
into a list of tokens, each of which is in one of the following forms: a string con- 
taining only letters, a string containing only digits, a string containing letters 
and digits, a single other character. After that the parser computes a set of fea- 
ture values describing each token. The tokens represented by vectors of features 
are then classified into several categories that correspond to metadata fields. 
The token classifier is the heart of the citation parser. The classifier is based 
on Conditional Random Fields and is built on top of GRMM and MALLET 
packages [TO] . 

We use 42 features to describe a token: 

— features based on the presence of a particular character class, eg. digits, 
lowercase/uppercase letters, Roman digits, 

— features checking if the token is a particular character (eg. a dot, square 
bracket, a comma or a dash), 

— features checking if the token is a particular word, 

— features checking whether the token is contained by the dictionary built from 
the dataset, eg. a dictionary of cities or words commonly appearing in the 
journal title. 

It is worth noticing that the class of a citation token depends in fact not 
only on its feature vector, but also on surrounding tokens. To make the classifier 



aware of this dependency, each feature vector is extended by adding the features 
of two preceding and two following tokens. 

Citation parser is a part of CERMINE — a metadata and content extraction 
tool [TT]. 
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Fig. 1. A citation string, its tokens and token classes. 



2.2 Metadata fields matching 

Not only do citations contain spelling errors, but also they differ in style which 
leads e.g. to differences in journal names abbreviating conventions. That is why 
even matching of parsed citations is not a trivial task. To address that we intro- 
duce the measures of similarity fitted to the specifics of various metadata fields. 
The overall similarity of two citation strings is obtained by applying a linear 
SVM using field similarities as features. 



Definitions Let trigrams(s) be a multiset of trigrams [12] from a string s. We 
define a trigram similarity between strings s and t as 

( t) 2 l^ T *3 rams ( s ) ^ trigrams(t)\ 
\trigrams(s) U trigrams{t)\ 

Token similarity is defined in a similar manner. Let tokens(s) be a multiset 
of tokens from a string s. Then 

\tokens(s) ntokens(t)\ 

SHTlfQf i:en yS : tJ = Z- — - : 

\tokens(sj U tokens(t)\ 



Author matching Author names in citation string can take various forms. 
In some cases given names are abbreviated, sometimes they are placed before, 
sometimes after the surname. Additional titles may also be added. We have 
developed two ways of measuring author field similarity. The first, most basic 
one, computes token and trigram similarity. 

The second way of defining similarity is much more sophisticated. It is based 
on finding the heaviest matching of tokens, what can be seen as an instance of 
assignment problem |13) . Each token can match at most one token in the other 
citation. Each pair of matched tokens has a weight assigned. The weight of a 



whole matching is the sum of weights of matched tokens. The weight of token 
pair is determined by a token similarity and their relative distance in the whole 
string. The distance component allows us to take into account the ordering of 
authors. It also lets us distinguish between papers authored by John Smith and 
Jane Doe and those written by Jane Smith and John Doe. The weights are 
assigned in a way that guaranties that the weight of the matching lays in [0,1] 
so it can be treated as similarity. The token similarity is measured in terms of 
edit distance, with just one exception: a pair of tokens, one of length lesser or 
equal 2 being a prefix of another, is assumed to have edit distance Q3] equal 1. 

Computation of the distance is more complex. First of all, we normalise the 
length of the whole citation to 1. Next, we align fully matching tokens of two 
strings to have the same position — we set it to be an average of their previous 
positions. Such tokens are called boundaries. Beginning and end of a string form 
additional boundaries. Note that the distance between tokens from a pair forming 
a boundary is equal to 0. Position of all other tokens is defined relatively to its 
closest boundaries. An example is provided in Fig. [2] 

AAA BBB CCC DDE 
I 1 1 1 1 1 

I 1 1 1 1 1 

ABB CCC DDD EEE 



Fig. 2. Token distances. The length of citation is normalised to 1. Perfectly matching 
tokens are aligned to have distance equal 0. All other tokens are positioned relatively 
to them. 



Source matching Journal names are abbreviated in the most random way. 
Usually only a prefix of a word is used ('appl.' instead of 'applied') or some 
letters are omitted ('journal' becomes 'jrnl').That is why we have decided to 
base source similarity on their LCS (longest common subsequence). We compute 
character-based LCS of two strings, divide by the length of shorter one and treat 
the result as similarity. 

Title matching The title of an article is not always present in a citation string. 
When it is, however, we usually need to deal only with spelling errors. That is 
why a trigram similarity is used in this case. 

Year matching All numbers tagged as year are extracted from a citation, their 
numeric value is computed and for each citation the number closest to 2000 is 
chosen. The final similarity is a binary value indicating if that numbers are equal. 



Pages matching We extract all numbers tagged as pages, create a set of them 
for each citation and then compute the ratio of their intersection and sum. 

Whole-string matching Additionally three features based on trigram simi- 
larity of the whole citation string are used. One of them compares unmodified 
strings and the other two transformed ones: the first containing only letters and 
the second only digits from the original strings. 

2.3 Evaluation 

The evaluation was conducted on the CORA-ref data set [T3] . It contains several 
citation clusters (each cluster consisting of citations referencing the same article). 
We have randomly distributed clusters into 3 slices (see Tab.[T]) and used them 
to perform cross validation. The half of training set for each fold was used in the 
CRF parser training and the other half was used for the SVM training. 

Table 1. Cluster slices. Some slice statistics are presented along with their usage in 
particular cross validation phases. 





sliceO 


slice 1 


slice2 


Cluster No. 


80 


89 


79 


Citation No. 


505 


596 


774 


Avg. cluster size 


6.31 


6.70 


9.80 


Max cluster size 


33 


115 


121 


Parser training 


foldO 


foldl 


fold2 


Matcher training 


fold2 


foldO 


foldl 


Matcher testing 


foldl 


fold2 


foldO 



Then, we have computed pairwise similarities between citation strings. The 
results were binarised by setting the similarity threshold at 50%. In order to 
obtain clusters, a single-link algorithm was applied. 

We have performed tests for both versions of author similarity measure (see 
section [2721 ) . All the results are presented in Tab. [2]and[3] The following metrics 
have been used: 

— cluster recall - the percentage of correct clusters that were recovered by a 
matcher 

— pairwise precision - the percentage of links returned by a matcher that 
are correct 

— pairwise recall - the percentage of correct links that were returned by a 
matcher 

— pairwise F\ - the harmonic mean of precision and recall 

As we see, the results are close to those reported in [16] . We can also notice that 
the choice of author similarity measure only slightly impacts them. 



Table 2. Matching results with complex author similarity measure. 





foldO 


foldl 


fold2 


avg. 


cluster recall 


65.82% 


72.50% 


77.53% 


71.95% 


pairwise precision 


95.21% 


97.51% 


94.98% 


95.90% 


pairwise recall 


93.91% 


93.06% 


97.43% 


94.80% 


pairwise Fi 


94.56% 


95.23% 


96.19% 


95.33% 



Table 3. Matching results with simple author similarity measure. 





foldO 


foldl 


fold2 


avg. 


cluster recall 


67.09% 


76.25% 


77.53% 


73.62% 


pairwise precision 


94.81% 


97.45% 


94.66% 


95.64% 


pairwise recall 


93.03% 


92.76% 


97.60% 


94.46% 


pairwise Fi 


93.91% 


95.05% 


96.11% 


95.02% 



3 Heuristic: author indexing 

The main scalability issue in the presented solution is its quadratic runtime. 
We, therefore, would like to limit the number of necessary pairwise comparisons. 
The standard approach used in an entity disambiguation is called blocking. The 
whole set of objects is divided into blocks so that entities are compared pairwise 
and possibly merged only within a block (cf. [17l [18j [19] ) . We have used different 
method, though. 

Bear in mind that the main focus of our system is a very specific type of 
entity disambiguation: linking citation strings to article metadata stored in the 
database. That means each citation will be matched to at most one metadata 
record and records in the store will not be merged. Having observed the above, 
we have decided to use heuristic based on indexing. 

Using an index, metadata records that have the biggest number of author 
tokens in common with the examined citation string are retrieved. Author tokens 
are those describing author name or surname. They were chosen to be used in 
our heuristic because we have noticed they are the most reliably parsed part of 
a citation string. In the following of this section the index is described in more 
detail. 



3.1 Non-exact matches 

Spelling errors occur commonly in citation strings. That is why an index sup- 
porting non-exact matches was desired. We have decided to implement ideas 
presented by Manning et al. in Chapter 3 of [20] to support retrieval of tokens 
with edit distance lesser or equal 1. Let us now present this method. 

Instead of putting as a key an exact word w, we put all the rotations of w$ 
(where $ is a character not present in an alphabet). For example, instead of key 



'cat', keys 'cat$', 'at$c', 't$ca' and 'Scat' are created. Then, to retrieve a word 
from the index, we also create all the rotations in a similar manner and for each 
rotation r of length n, all the keys of length < n that match at least n — 1 first 
letters of r and keys of length < n + 1 that match first n letters are returned. 

For instance, to lookup word 'cut', we would create a set of rotations 'cut$', 
'ut$c', 't$cu' and 'Scut'. The first three letters of 't$cu' match 't$ca', so this 
key would be retrieved. In the similar manner, to lookup 'at', we would have a 
rotation 'at$' which would match 'atSc'. 

The whole process can be formalised in the following steps: 

1. Generate all rotations for a token. 

2. For each rotation r find all matching rotations in the index: 

(a) Let r — be, where |c| = 1 

(b) Find in the index the lexicographically smallest token t, such that b is 
prefix of t 

(c) Scan the following index entries to retrieve all words beginning with b of 
max length |r| and all beginning with r of max length \r + 1| with their 
document IDs 

3. Flatten document ID lists and convert them to a set. 
3.2 Heuristic summary 

To conclude, let us present the whole heuristic matching process: 

1. Extract all author tokens from the citation 

2. For each token, retrieve all matching documents 

3. For each document retrieved, compute the number of matching tokens 

4. Filter out documents containing less than max(l, M — 1) matching authors, 
where M = the maximum number of matching authors 

5. All remaining documents heuristicly match the citation 

4 Hadoopisation 

Apache Hadoop is the most notable implementation of MapReduce paradigm. In 
this section we present how to implement the algorithms described above using 
MapReduce and Hadoop environment. To do that accurately, though, we need 
to describe some technicalities first. Even more technical details not covered in 
this article are discussed in [2"T] . 

Hadoop SequenceFile is a binary file containing a list of key-value pairs. A 
SequenceFile with sorted keys enriched by additional index file enabling fast 
record retrieval is called a MapFile. This data structure is used to store our 
index. Additionally, we assume our input and output data will be stored as 
SequenceFiles. 



4.1 Index building 



Let us begin with the presentation of the index building process. The most crucial 
part is presented in Fig. [3] Note that steps transforming one entry into many 
can be implemented as map tasks (i.e. token extraction and rotations generation) 
and those transforming many into one as reduce (i.e. grouping). 

The process depicted generates all necessary entries and stores them as a 
SequenceFile. To sort it and transform into a MapFile Hadoop built-in functions 
are used. 
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Fig. 3. Index building. The documents are read from a SequenceFile and all au- 
thor tokens (tokenised names) and document IDs are extracted from the metadata. 
Then, document IDs with the same tokens are grouped. Eventually, for each token, its 
rotations are generated and everything is persisted in a SequenceFile. 



4.2 Actual matching 

Having built the index, we can step to the actual matching phase. It is presented 
in Fig 21 Here, the reference extraction and the heuristic matching are done in 
map steps. Choosing the best matching document is achieved by selecting the one 
with the biggest similarity to the citation string, what is done as a reduce step. 
The similarity is computed using metrics defined in section [2?2l with whole-string 
features omitted and the simple version of author matching. 



Documents References Best matching pairs 

candidate documents 
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Fig. 4. Citation matching steps. The documents are read from an appropriate Se- 
quenceFile and references are extracted from their metadata. Then, the actual matching 
occurs: in the first step heuristic is used to find documents that may match citations, 
in the next the best match is selected for each citation. Eventually, the results are 
persisted in a SequenceFile. 



4.3 Speed evaluation 

We have evaluated efficiency of our solution using PMC Open Access Subset 
document set. It consists of over 450 thousand documents containing 12 million 
citations. 

The benchmark was performed on our Hadoop cluster [22] which consists of 
a four "fat" slave nodes and a virtual machine on a separate physical machine 
in the role of NameNode, JobTracker and HBase master. Each worker node has 
four AMD Opteron 6174 processors (48 cores in total), 192 GB of RAM, four 
600 GB disks which work in RAID 5 array. 

The time spent in each phase is presented in the Table |4] 



Table 4. The time spent in individual phases. 



Phase 


Time spent 


Task No. 


Map 


Reduce 


Index building 


All 


0:00:57 


13 


2 




Citation extraction 


0:00:46 


13 





Matching 


Heuristic matching 


3:01:38 


745 







Selecting the best match 


2:51:00 


996 


1 



5 Conclusions and future work 

In this paper we have presented an efficient citation matching solution using 
Apache Hadoop. After developing a basic citation matching technique, we have 
shown how to scale it up to handle millions of citations. In particular, we 



have presented a way of creating an approximate index using the MapReduce 
paradigm. 

The big data enables new, unparalleled possibilities, which need more re- 
search. It is especially worth investigating how our model training phase may 
benefit from huge amounts of data. On the other hand, when dealing with enor- 
mous training set, perhaps many of examples are not relevant. Maybe we want to 
select only the most important ones. If so, how do we do that? Only the further 
investigation may answer these questions. 
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